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Introduction 
Toward heterogeneous multi-core architectures 


* Multicore is here 
e Hierarchical architectures Mixed Large 
e Manycore and 

Small Cores 


* Architecture specialization 
e Now 


— Accelerators (GPGPUS, 
FPGAS) 


— Coprocessors (Xeon Phi) 
— Fusion 
— DSPs 
— Allof the above 
s Inthe near Future 
— Many simple cores 
— A few full-featured cores 


How to program these architectures? 


Multicore 


e Multicore programming 
e pthreads, OpenMP, TBB, ... 
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How to program these architectures? 


Accelerators 


* Multicore programming 
e pthreads, OpenMP, TBB, ... 


* Accelerator programming 
e Consensus on OpenCL/OpenACC? 
e (Often) Pure offloading model 


OpenMP 


A portable approach to shared-memory programming 


Extension to existing languages 
e C, C++, Fortran 
e Set of programming directives 


Fork/join approach 
e Parallel sections 


Well suited to data-parallel programs 
e Parallel loops 


OpenMP 3.0 introduced tasks 
e Support for irregular parallelism 


int matrix [MAX] [MAX]; 


#pragma omp parallel for 


for (int i; i < 400; i++) 
{ 


matrix [a] [0] += «us 


How to program these architectures? 


Accelerators 


Accelerator programming 


* OpenMP extension 


int matrix [MAX] [MAX] ; 


tpragma omp target device (accoO) 
map (matrix) 


#pragma omp parallel for 


for (int i; i < 400; 1++) 
{ 


matrix [a] [0] += ... 


e Still quite hand-tuned 


How to program these architectures? 


Accelerators 


Accelerator programming 
e OpenACC 


int matrix [MAX] [MAX] ; 


#pragma acc kernels copy (matrix) 


for (int i; i < 400; 1++) 
{ 


matrix[i] [0] += sas 


e Again quite hand-tuned 


How to program these architectures? 


Multicore Accelerators 


* Multicore programming 
e pthreads, OpenMP, TBB, ... 


* Accelerator programming 
e Consensus on OpenCL/OpenACC? 
e (Often) Pure offloading model 


* Hybrid models? om] 
e Take advantage of all resources O 
e Complex interactions and mj 


distribution © 


Task graphs 


e Well-studied expression of parallelism 


e Departs from usual sequential programming 


Really ? 


Task management 


Implicit task dependencies 


+ Right-Looking Cholesky decomposition (from PLASMA) 


On Nd = O Ny IA 
POTRE (RW,A[j][j]); 
Lor (1.= Jéls 1 SNS op) 
IRSM (RW A Be 
Tori == Jl NS tod 
SYRK (RW,A[i] [1], R,Ali1(31);5 
for (k = TEL? K< UP KFF) 
GEMM (RW,A[i] [k], 
R;Ali] [j], R,A[k] [J]); 


j 
) 


task wait for all (): C] TRSM 
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Task management 


Implicit task dependencies 


+ Right-Looking Cholesky decomposition (from PLASMA) 


AE o A IS NA a 
POTRE C (RW ATIII]? 
bor (iL. = yl? a-<— NP et) O 
TNA [J]; R,A[J][J]); 
Torei Ge i ds NE tod 
SYRK (RW,A[i] [i], Ry,Ali1[31); 
for (k = TEL? K< UP KFF) 
GEMM (RW,A[i] [k], 
R;Ali] [j], R,A[k] [J]); 


j 
) 


task wait for all (): C] TRSM 
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Task management 


Implicit task dependencies 


+ Right-Looking Cholesky decomposition (from PLASMA) 


On A = O Ny a 

POTRE (RW,A[j][j]); 
for (i. =) els. 1 <NÑNp a 

TROM (RM, ALL idle BAILES I 
or qe AN L 
SYRK (RW,A[i] [i], R,A[i][91); 
for (k = TEL? K< UP KFF) 

GEMM (RW,A[i] [k], 

R;Ali] [j], R,A[k] [J]); 


j 
) 


task wait for all (): C] TRSM 
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Task management 


Implicit task dependencies 


+ Right-Looking Cholesky decomposition (from PLASMA) 


On Nd = O Ny IA 

POTRE (RW,A[j][j]); 
for (i. =) tle a =< Np ae} 

TROM (RW, ALL idle RALLY | k 
or qe a Ni 

SYRK (RW,A[i] [i], R,A[i][91); 

for (k = TEL? K< UP KFF) 

GEMM (RW,A[i] [k], 
R;Ali] [j], R,A[k] [J]); 


j 
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task wait for all (): C] TRSM 
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Task management 


Implicit task dependencies 


+ Right-Looking Cholesky decomposition (from PLASMA) 


On Nd = O Ny IA 

POTRE (RW,A[j][j]); 
for (i. =) els. a = No ae} 

TROM (RW, ALL idle BAILES ee 
POR) qe ds NS L 

SYRK (RW,A[i] [i], R,A[i][91); 

for (k = TEL? K< UP KFF) 

GEMM (RW,A[i] [k], 
R;Ali] [j], R,A[k] [J]); 


j 
) 


task wait for all (): C] TRSM 
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Task management 


Implicit task dependencies 


+ Right-Looking Cholesky decomposition (from PLASMA) 


On Nd = O Ny IA 
POTRE (RW,A[j][j]); 
Lor (1.= Jéls 1 SNS op) 
TROM (SW A BALI 
tor (a = ls a Ne I 
SYRE.(RW,A[1](1], R,-A(1][31);5 
for (k = TELA RS L KFF) 
GEMM (RW,A[i] [k], 
R;Ali] [Jj], Alan 


j 
) 


task wait for all (): C] TRSM 
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Task management 


Implicit task dependencies 


+ Right-Looking Cholesky decomposition (from PLASMA) 


On Nd = O Ny IA 

POTRE (RW,A[j][j]); 

Tor. (1-= tls 1 Ny ab) 

TRSM (RW,A[i][j], R,A[j][j]); 

o Si NI 
SYRK (RW,A[i] [1], R,A[1] [91] ]) 7; 
Cor (k = gtl k< KF) 

GEMM (RW,A[i][k], 

RAT) lol, Bale 1914 


) 


) 
task wait for all (): C] TRSM 
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Task management 


Implicit task dependencies 


+ Right-Looking Cholesky decomposition (from PLASMA) 


For A = O Ny a 

POTRE (RW,A[j][j]); 

Tor. (1-= tls 1 Ny ab) 

TRSM (RW,A[i][j], R,A[j][j]); 

o Si NI 
SYRK (RW,A[i] [1], R,A[1] [91] ]) 7; 
Cor (k = TEL e KFF) 

GEMM (RW,A[i] [kJ], 

RA la] lo), B,A41x1 1917 


) 


) 
task wait for all (): C] TRSM 


Mo 
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Task management 


Implicit task dependencies 


+ Right-Looking Cholesky decomposition (from PLASMA) 


On Nd = O Ny IA 
POTRE (RW,A[j][j]); 
Lor (1.= Jéls 1 SNS op) 
TROM (SW A BALI 
tor (a = ls a Ne I 
SYRE.(RW,A[1](1], R,-A(1][31);5 
for (k = TELA RS L KFF) 
GEMM (RW,A[i] [k], 
R;Ali] [Jj], Alan 


j 
) 


task wait for all (): C] TRSM 
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Task management 


Implicit task dependencies 


e Right-Looking Cholesky decomposition (from PLASMA) 


For A E IA 

POTRE (RW,A[j][j]); 

Tor. (1-= tls 1 Ny ab) 

TRSM (RW,A[i][j], R,A[j][j]); 

o Si NI 
SYRK (RW,A[i] [1], R,A[1] [91] ]) 7; 
Cor (k = gtl k< KF) 

GEMM (RW,A[i][k], 

RAT) lol, Bale 1914 


) 


) 
task wait for all (): C] TRSM 
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Task management 


Implicit task dependencies 


+ Right-Looking Cholesky decomposition (from PLASMA) 


On Nd = O Ny IA 
POTRE (RW,A[j][j]); 
Lor (1.= Jéls 1 SNS op) 
TROM (SW A BALI 
tor (a = ls a Ne I 
SYRE.(RW,A[1](1], R,-A(1][31);5 
for (k = TELA RS L KFF) 
GEMM (RW,A[i] [k], 
R;Ali] [Jj], Alan 


j 
) 


task wait for all (): U) TRSM 
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Task management 


Implicit task dependencies 


e Right-Looking Cholesky decomposition (from PLASMA) 


OR A IS NA DAA 
POTRE C (RW ATIII]? 
bor (iL. = yl? a-<— NP et) 
TNA III 
TOG cs i a NI 
SYRK (RW,A[i] [i], Ry,Ali1[31); 
for (k = TEL? K< UP KFF) 
GEMM (RW,A[i] [k], 
R;Ali] [j], R,A[k] [J]); 


j 
) 


task wait for all (): C] TRSM 
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Task management 


Implicit task dependencies 


+ Right-Looking Cholesky decomposition (from PLASMA) 


On Nd = O Ny IA 
POTRE (RW,A[j][j]); 
Lor (1.= Jéls 1 SNS op) 
IRSM (RW A Be 
Tori == Jl NS tod 
SYRK (RW,A[i] [i], Ry,Ali1[31); 
for (k = TEL? K< UP KFF) 
GEMM (RW,A[i] [k], 
R;Ali] [j], R,A[k] [J]); 


) 
) 


task wait tor all 13 


Mo 
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How to program these architectures? 


Multicore Accelerators 


e À uniform way 


e Use a single (or a combination 
of) high-level programming 
language to deal with network + 
multicore + accelerators 


e Increasing number of directive- 
based languages 


— Use simple directives... and 


good compilers! 
— XcalableMP 
— HMPP 
— StarSs 
e Much better potential for 
composability op 


— If compiler is clever! 
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Challenging issues at all stages 


e Applications 
e Programming paradigm 
e BLAS kernels, FFT,... 


* Compilers 
e Languages 
e (Code generation/optimization 


* Runtime systems 
e Resources management 
e Task scheduling 


e Architecture 
e Memory interconnect 


Expressive interface 


HPC Applications 


Compiling Specific 
environment librairies 


Operating System 


Hardware 


Execution Feedback 
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Challenging issues at all stages 


E ive interf. 
- Applications xpressive interface 


e Programming paradigm 


e BLAS kernels, FFT, ... HPC Applications 
* Compilers Compiling Specific 
e Languages environment librairies 


L PE u 4" 


e Code generati 
* Runtime systems 


e Resources management 
e Task scheduling 


Operating System 


Hardware 


e Architecture 


e Memory interconnect 
Execution Feedback 


Overview of StarPU 


Rationale 


Task scheduling 
e Dynamic 
+ On all kinds of PU 
— General purpose 
— Accelerators/specialized 


Memory transfer 
e Eliminate redundant 
transfers 


e Software VSM (Virtual 
Shared Memory) 


40 
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The StarPU runtime system 


HPC Applications 


Execution model 


Scheduling engine 


Specific drivers 


CPUs GPUs SPUs 


Mastering CPUs, GPUs, SPUs ... “PUS — StarPU 


The StarPU runtime system 


The need for runtime systems 


e “do dynamically what HPC Applications 
cant be done statically Parallel Parallel 
anymore Compilers Libraries 


* Compilers and libraries 
generate (graphs of) tasks 
e Additional information is 


Drivers (CUDA, OpenCL) 


* StarPU provides 


CPU 


GPU 


e Task scheduling 
e Memory management 


R< ee nd 


Data management 


+ StarPU provides a Virtual HPC Applications 
Shared Memory (VSM) Parallel Parallel 
subsystem Compilers Libraries 


e Replication 
e Weak consistency 
e Single writer 


e High level API 
o Input & ouput of tasks — Drivers (CUDA, OpenCL) 
reference to VSM data CPU GPU 


R< ee nd 


The StarPU runtime system 


Task scheduling 


e Tasks = HPC Applications 
e Data input & output Parallel Parallel 
— Reference to VSM data Compilers Libraries 


e Multiple implementations 


— E.g. CUDA + CPU 
implementation 


e Non-preemptible 


e Dependencies with other 
tasks 


e Scheduling hints 


DA, OpenCL) 
RPU 


e StarPU provides an Open 
Scheduling platform 
e Scheduling algorithm = 
plug-ins 


Mo 


The StarPU runtime system 
Task scheduling 


* Who generates the code ? HPC Applications 
e StarPU Task ~= function pointers Parallel Parallel 
e StarPU doesn't generate code Compilers Libraries 


e Libraries era 
e PLASMA + MAGMA 
e FFTW + CUFFT... 


* Rely on compilers 
e PGI accelerators 
e CAPS HMPP... 


DA, OpenCL) 
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Task management 


Implicit task dependencies 

e Right-Looking Cholesky decomposition (from PLASMA) 
For (k = O .. tiles — 1) pa 
{ 


POTRF(A[K,k]) 
for (m = k+1 .. tiles — 1) 
e TRSM(A[k,k], A[m,k]) 
k for (m = k+1 .. tiles — 1) 
SYRK(A[m,k], A[m,m]) 


for (m = k+1 .. tiles — 1) 
for (n = k+1 ..m- 1) 
GEMM(A[m,k], A[n,k], A[m,n]) 


The StarPU runtime system 


Development context 


e History 
e Started about 6 years ago 


— PhD Thesis of Cédric Augonnet 
e StarPU main core ~ 40k lines of code 
e Written in C 
* Open Source 
e Released under LGPL 
e Sources freely available 


— svn repository and nightly tarballs 
— See http://runtime.bordeaux.inria.fr/StarPU/ 
e Open to external contributors 


+ [HPPC'08] 


e [Europar'09] — [CCPE'11],... 2400 citations 


The StarPU runtime system 
Execution model 


The StarPU runtime system 
Execution model 


Submit task KA += B » 


The StarPU runtime system 
Execution model 


GPU o CPU#k 


Schedule task 


The StarPU runtime system 
Execution model 


GPU o CPU#k 


Fetch data 


The StarPU runtime system 
Execution model 


Fetch data 


The StarPU runtime system 
Execution model 


Fetch data 


The StarPU runtime system 
Execution model 


E 


GPU ^ 


EN 


Offload computation 


The StarPU runtime system 
Execution model 


Notify termination 


Optimizations 

e Task pipelining 

e Task execution / data transfer overlap 

e GPU-GPU copies 

e Data prefetch 

Thus needs 

e Asynchronous API with fine-grain synchronization 
e Non-blocking API 

e Pitched 2D copy & such 

e Thread safety 


Host memory mapped on GPU & vice-versa is useful, too 


Programming interface 


R< sË 


Scaling a vector 


Data registration 


* Register a piece of data to StarPU 
e float array[NX]; 
for (unsigned i = 0; i < NX; i++) 
arraylij = 1.0f; 


starpu data handle vector_handle; 
starpu vector data register(&vector handle, 0, 
array, NX, sizeof(vector[0])); 


* Unregister data 
e starpu data unregister(vector handle); 


Scaling a vector 
Defining a codelet (4) 


e Codelet = multi-versionned kernel 
e Function pointers to the different kernels 
e Number of data parameters managed by StarPU 


starpu_codelet scal cl= { 
.cpu_func = scal cpu func, 
.cuda func = scal cuda func, 
.opencl func = scal opencl func, 
nbuffers = 1, 
.modes = STARPU RW 


Scaling a vector 


Defining a task 


* Define a task that scales the vector by a constant 
struct starpu task “task = starpu task create(); 
task->cl = &scal cl; 


task->buffers[0].handle = vector handle; 


float factor = 3.14; 
task->cl arg = &factor; 
task->cl arg size = sizeof(factor); 


starpu task submit(task); 
starpu task wait(task); 


Scaling a vector 


Defining a task, starpu insert task helper 


* Define a task that scales the vector by a constant 


float factor = 3.14; 


starpu insert task( 
&scal cl, 
STARPU RW, vector handle, 
STARPU VALUE &factor,sizeof(factor), 
0); 
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Scaling a vector 


Defining a task, gcc plugin 


void scale vector(int size, float vector[size], float factor) 
_ attribute ((task)): 
void scale vector cpu(int size, float vector[size], float factor) 
_ attribute ((task implementation("cpu", scale vector))) : 
void scale vector cpu(int size, float vector[size], float factor) 
bu) 
int main(void) { 
static float input[NX]; 
#pragma starpu register input 


scale vector(NX, input, 42); 


#pragma starpu wait 


#pragma starpu unregister input 
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Scaling a vector 


Defining a task, gcc plugin 


void scale_vector(int size, float vector[size], float factor) 
_ attribute ((task)): 
void scale_vector_cpu(int size, float vector[size], float factor) 
_ attribute ((task implementation("cpu", scale vector))) ; 
void scale_vector_cpu(int size, float vector[size], float factor) 
bu) 
int main(void) ( 
static float input[NX]; 
#pragma starpu register input 
scale vector(NX, input, 42); 
frob vector(NX, input, out1); 
shred vector(NX, input, out2); 


#pragma starpu wait 


#pragma starpu unregister input 
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Scaling a vector 


Defining a task, gcc plugin 


void scale vector(int size, float vector[size], float factor) 
… attribute ((task)): 
void scale vector opencillint size, float vector[size], float factor) 
attribute ((task implementation("opencl", scale vector))) : 
#pragma starpu opencl scale vector opencll 


“vector-scale.ci” "vector scal kern" | 


group size ngroups ; 
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Mixing PLASMA and MAGMA with StarPU 


e QR decomposition 

+ Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060) 
1100 
1000 
900 
800 
700 


Speed (in Gflop/s) 


0 5000 10000 15000 20000 25000 30000 35000 40000 
Matrix order 
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Mixing PLASMA and MAGMA with StarPU 


e QR decomposition 
e Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060) 


1100 

1000 

sisë | +12 CPUs 

800 | ~200GFlops 
Z 500 Jvs measured 
Z 500 150Gflops ' 
ê 400 

300 Thanks to 

200 ‘ 

eterogeneity 
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0 
0 5000 10000 15000 20000 25000 30000 35000 40000 
Matrix order 
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Mixing PLASMA and MAGMA with StarPU 


* « Super-Linear » efficiency in QR? 
e Kernel efficiency 
— sgeart 
— CPU: 9 Gflops GPU: 30 Gflops (Speedup : ~3) 
— stsart 
— CPU: 12Gflops GPU: 37 Gflops (Speedup: ~3) 
— somar 
— CPU: 8.5 Gflops GPU: 227 Gflops (Speedup: ~27) 
— Sssmar 
— CPU: 10Gflops GPU: 285Gflops (Speedup: ~28) 
e Task distribution observed on StarPU 
— sgegrt: 20% of tasks on GPUs 
— Sssmar: 92.5% of tasks on GPUs 
e Taking advantage of heterogeneity ! 
— Only do what you are good for 
— Don't do what you are not good for 
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Conclusion 


Summary 

Tasks 

* Nice programming model HPC Applications 

* Runtime playground Parallel Parallel 


e Algorithmic playground 


* Used for various computations 
+ Cholesky/QR/LU (dense/sparse), FFT, stencil, 


CG, FMM... 


Operating System 
CPU GPU 


e http://starpu.gforge.inria.fr 


Conclusion 


Summary 


Scheduling researchers can experiment 
and tune various heuristics 


e On actual applications 


* Without even needing the hardware 
e And with fast experimentation time 


Optimize 


e Completion time 


* Memory consumption 
Scheduling expertise 


* Energy consumption 


+ http://starpu.gforge.inria.fr 
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HPC Applications 


Parallel Parallel 
Compilers Libraries 


Operating System 
CPU GPU 


